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A Computationally Efficient Implementation of 
Fictitious Play for Large-Scale Games 

BRIAN SWENSONt*, SOUMMYA KARt AND JOAO XAVIER* 

Abstract 

The paper is concerned with distributed learning and optimization in large-scale settings. The well- 
known Fictitious Play (FP) algorithm has been shown to achieve Nash equilibrium learning in certain 
classes of multi-agent games. However, FP can be computationally difficult to implement when the 
number of players is large. Sampled FP is a variant of FP that mitigates the computational difficulties 
arising in FP by using a Monte-Carlo (i.e., sampling-based) approach. The Sampled FP algorithm has 
been studied both as a tool for distributed learning and as an optimization heuristic for large-scale 
problems. Despite its computational advantages, a shortcoming of Sampled FP is that the number of 
samples that must be drawn in each round of the algorithm grows without bound (on the order of 
where t is the round of the repeated play). In this paper we propose Computationally Efficient 
Sampled FP (CESFP)—a variant of Sampled EP in which only one sample need be drawn each round 
of the algorithm (a substantial reduction from 0{'/t) samples per round, as required in Sampled EP). 
CESEP operates using a stochastic-approximation type rule to estimate the expected utility from round 
to round. It is proven that the CESEP algorithm achieves Nash equilibrium learning in the same sense as 
classical EP and Sampled EP. Simulation results suggest that the convergence rate of CESEP (in terms 
of repeated-play iterations) is similar to that of Sampled EP. 


I. Introduction 

A game-theoretic learning algorithm is an adaptive multi-agent procedure which can enable 
a system of interacting agents to achieve desirable global behavior using local (agent-based) 
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control laws. Such algorithms have a wide range of applieations in distributed eontrol IfTI-lfSl 
and large-seale optimization ithll- lfTOll . 

Fietitious Play (FP) IfTTlI is an arehetypal game-theoretie learning algorithm that has reeeived 
much attention over the years due to its intuitively simple nature and proven eonvergenee results 
for eertain important elasses of games (see, for example, ifT^ . 1 131 and referenees therein). Of 
partieular interest are results demonstrating that FP leads player^] to learn equilibrium strategies 
in potential games—a elass of multi-agent games in whieh there may be an arbitrarily large 
number of players lfT4l . [fTSl . 

However, sueh eonvergenee results tend to be of limited praetieal value due to eomputational 
diffieulties that may arise when implementing FP in large games. In partieular, in eaeh stage of 
the FP algorithm, eaeh player i must eompute the expeeted (mixed) utility for eaeh of her aetions 
given her eurrent beliefs regarding opponents’ strategies. Evaluating this expeeted utility—the 
domain of whieh is an (n — 1) -dimensional probability simplex—is a problem whose eomplexity 
in general seales exponentially in terms of the number of players, n. 

The main foeus of the present paper is the presentation of a variant of FP that might be 
more praetieal to implement in eertain large-seale settings. In partieular, we eonsider a praetieal 
method for mitigating eomputational eomplexity using a Monte-Carlo type approaeh. 

Sampled FP dH, ifTOll . [[T^ - [[T9l introduced the idea of mitigating eomplexity in FP using a 
Monte-Carlo (i.e., sampling-based) approaeh. At eaeh iteration of the Sampled FP algorithm, 
players approximate the expeeted utility by drawing several samples from an underlying prob¬ 
ability distribution. Players then myopieally ehoose an “optimal” next-stage aetion using the 
approximated utility as a surrogate for the true expeeted utility. The work dH showed that, as 
long as the number of samples drawn eaeh round grows suffieiently quiekly, players learn an 
equilibrium in the same sense as FP, almost surely (a.s.). 

In essenee. Sampled FP aehieves a mitigation in eomplexity by avoiding any direet evaluation 
of the expeeted utility. However, Sampled FP has a notable shorteoming: In order to guarantee 
learning is aehieved, the number of samples that must be drawn in eaeh iteration (i.e., round) 
of the algorithm grows without bound (on the order of \/f samples per round, where t is the 
eurrent round of the repeated play algorithm). 


*We use the terms agent and player interchangeably throughout the paper. 
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In this paper, we propose a variant of Sampled FP—which we call Computationally Efficient 
Sampled FP (CESFP)—in which only one sample need be drawn each round of the repeated 
play process. CESFP achieves the same fundamental computational advantage as Sampled FP 
(i.e., direct evaluation of the expected utility is avoided) but does so by drawing only one sample 
per round (rather than the 0(\/f) samples per round, required in Sampled FP). 

Intuitively, the reduction in the required number of per-round samples is accomplished by 
treating the expected utility process as quasi static. Such treatment is possible due to the 
diminishing incremental step size in the the expected utility process. In CESFP, the sample data 
gathered in the current round of repeated play is recursively combined with sample data from 
previous rounds using a stochastic-approximation-type estimation rule. This may be contrasted 
with Sampled FP where, in each round of the repeated play, data gathered from sampling in the 
previous round is wholly discarded and a fresh set of samples is gathered to approximate the 
expected utility for the upcoming round. (See Section IV-CI for more details.) 

Due to the improved efficiency in information handling, CESFP is able to achieve convergence 
at a rate similar to that of sampled FP (in terms of repeated-play iterations) despite drawing far 
fewer samples per-iteration. (See Section for more details.) 

The main contribution of the paper is the presentation of the CESFP algorithm and proof of 
convergence of the algorithm in terms of empirical frequency to the set of Nash equilibria (a.s.). 
The proof relies on showing that the CESFP process may be seen as a Generalized Weakened 
FP process as studied in iflOll . 

CESFP may be applicable as a computationally efficient variant of FP in a variety of settings 
including large-scale optimization lUl, ifT^ . dynamic programming [fT^ . ifTTll . traffic routing 
ifT^ . and cognitive radio (H, EH . E2ll . and learning in Markov decision processes lfT9l . CESFP 
may also be used as a general tool for distributed learning EH or control |[I1. 

Related works have studied approaches for mitigating computational issues arising in large- 
scale implementations of FP. Joint Strategy FP (JSFP) E4]| studies a variant of FP in which 
players update a utility estimate using a computationally-simple recursive procedure and choose 
next-stage actions using a best-response rule combined with an inertial term. JSFP is shown to 
converge to pure strategy Nash equilibria (NE) in ordinal potential games but is fundamentally 
different from CESFP (and FP) in that the tracked utility corresponds to an empirical distribution 
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taken over joint aetionso and eonvergenee may only oeeur at pure NE. 

Payoff-based learning algorithms—ineluding those based on FP ll20]| . Il25ll - ll27ll and otherwise 
[l28]| - ll^ —tend to be eomputationally simple in large games and have the further advantage that 
they do not require players to have any knowledge of the game’s utility strueture. However, sueh 
algorithms implieitly assume players have aeeess to instantaneous payoff information and may 
not be applieable in settings where this information is eostly to obtain, delayed, or otherwise 
unavailable. 

For example, in a follow-up paper [[2^ we eonsider an applieation in whieh CESFP is 
implemented in a network-based setting in whieh all inter-agent eommunieation is restrieted 
to a preassigned (possibly sparse) eommunieation graph [[3T]| . Instantaneous payoff information 
ean be diffieult to obtain in sueh a setting, partieularly in the ease that the utility eorresponds to 
a non-loeal welfare-type utility funetion, whieh may not be physieally measurable at any single 
agent. Furthermore, there are eireumstanees in whieh eaeh round of physieal game play ean ineur 
an exogenous eost. In sueh eases it may be preferable to supplement payoff-based learning— 
whieh depends on interaetion in the physieal environment—with forms of model-based learning. 

The remainder of the paper is organized as follows. Seetion UH sets up the notation to be used 
in the subsequent development. Seetion HII] reviews elassieal FP. Seetion |IV] reviews Sampled 
FP. Seetion |V] presents the CESFP algorithm, states the main eonvergenee result for CESFP, and 
proves the result. Seetion presents a simulation example eomparing Sampled FP and CESFP. 
Seetion IVTII provides eoneluding remarks. 

II. Preliminaries 

A game in normal form is represented by the tuple F := (N, (Yi,Ui)i^N), where N = 
n} denotes the set of players, T) denotes the finite set of notions available to player 
i, and Ui : IljeAr ^ denotes the utility funetion of player i. Denote by Y := n 
joint notion spaee. 

In order to guarantee the existenoe of Nash equilibria it is neeessary to eonsider the mixed- 
extension of r in whieh players are permitted to play probabilistio strategies. Fet rrii := \Yi\ be 

^In FP, Sampled FP, and CESFP, players best respond to the product of marginal empirical distributions (or an estimate thereof) 
which implicitly presumes a form of independence among opponents’ strategies. Tracking and responding to the empirical 
distribution of joint actions, as in JSFP, fundamentally alters tbe dynamics of classical FP. CESFP achieves computationally 
efficiency while preserving the basic dynamical structure of FP. 
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the cardinality of the action space of player i, and let Aj := {p G M™' : ^ 

0, V/c} denote the set of mixed strategies available to player i —note that a mixed strategy is a 
probability distribution over the action space of player i. Denote by A” := Hie at 
joint mixed strategies. When convenient, we represent a mixed strategy p G A” by p = (pj,p_j), 
where Pi denotes the marginal strategy of player i and is a (n — 1)-tuple containing the 
marginal strategies of the other players. 

In the context of mixed strategies, we often wish to retain the notion of playing a single 
deterministic action. For this purpose, let Ai := {ei,..., emj denote the set of “pure strategies” 
of player i, where Cj is the j-th canonical vector containing a 1 at position j and zeros otherwise. 
Note that there is a one-to-one correspondence between a player’s action set Yi and the player’s 
set of pure strategies Aj c A*. 

The mixed utility function of player i is given by 



( 1 ) 


y& 


where Ui : A"^ —)■ M. Note that the mixed utility Ui{p) may be interpreted as the expected utility 
of Ui{y) given that players’ (marginal) mixed strategies Pi are independent. 

The set of Nash equilibria is given by NE := {p G A” : Ui{pi,p_i) > Ui{p[,p_i), Vp- G 
Aj, Vi G N}. The distance of a distribution p G A"^ from a set S' C A” is given by d{p, S) = 
inf{||p — p'll : p' G S}. Throughout the paper || ■ || denotes the standard £2 Euclidean norm 
unless otherwise specified. 

Throughout, we assume there exists a probability space (D, E, P) rich enough to carry out 
the construction of the various random variables required in the paper. For a random object X 
defined on a measurable space (D, E), let a{X) denote the a-algebra generated by X ^32^ . As a 
matter of convention, all equalities and inequalities involving random objects are to be interpreted 
almost surely (a.s.) with respect to the underlying probability measure, unless otherwise stated. 

A. Repeated Play 

The learning algorithms considered in this paper all assume the following format of repeated 
play. Let a normal form game T be fixed. Let players repeatedly face off in the game T, and for 
t G {1, 2,...}, let ai{t) G Ai denote the action played by player i in round t. Let the n-tuple 
a{t) = (ai(f),. .., an{t)) denote the joint action at time t. 
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Let the empirical history distribution (or empirical distribution) of player i be given b)L 
qiit) := 7 let the joint empirical distribution be given by the n-tuple q{t) = 

{qi{t),...,qn{t)). 


III. Fictitious Play 

A FP process may be intuitively described as follows: A finite set of n agents engage in 
repeated play of some fixed normal form game. Each round of the repeated play, each agent i 
plays an action that is myopically optimal under the (naive) assumption that all opponents are 
playing according to time-invariant and statistically independent strategies. In particular, under 
this assumption, each player i believes that the empirical distribution g_i(f) of opponents’ play 
is an accurate representation of opponents’ (supposedly time-invariant) strategies and chooses a 
next-stage action that optimizes their utility given this belief. 

A formal description of the FP algorithm is given below. 

A. FP algorithm 
Initialize 

(i) Each player i chooses an arbitrary initial action aj(l) G Aj. The empirical distribution is 
initialized as gj(l) = aj(l), Vi 

Iterate it > 1) 

(ii) Each player i chooses her next-stage action as a best response to the current empirical 
distribution of opponents’ play: 

Oiit -f 1) e arg max Ui{ai, q-i{t)). (2) 

cii&Ai 

(iii) Eor each player i, the empirical distribution is updated to reflect the action just taken, 

qi{t + 1) = ^ or equivalently in recursive form: 

qi{t + 1 ) = qi{t) + + 1 ) - qi{t))- 

^Note that each ai{t) € Ai is a delta distribution, and thus the empirical distribution qi{t) is a normalized histogram of the 
action choices of player i. 
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B. Discussion 

Various works have established results characterizing classes of games in which FP does 
(or does not) lead players to learn NE strategies (see monographs ifT^ . iflBll l. Of particular 
relevance to the large-scale setting is a class of multi-agent games known as potential games 
m- A game F = {N, {Yi,Ui{-))i^N) is said to be a potential game if there exists a potential 
function 0 : —)■ M such that for all i E N, and all y^i G klj 

Ui{yi,y-i) -Ui{xi,y_i) = 4>{yi,y_i) -4>{xi,y_i), ^yi,Xi e k). 

Intuitively, the existence of a potential function means that all player’s utility functions are 
aligned in such way that players share a common underlying objective. It has been shown [[T4]| . 
[fTSll that if r is a potential game, then FP leads players to learn NE strategies in the sense that 

limt^oo d(g(f), NE) = 0. 

C. Computational Complexity in large-scale FP 

While EP is theoretically proven to achieve NE learning in potential games, it can be com¬ 
putationally difficult to implement when the number of players is large. In particular, note 
that in order to choose a next-stage action (see ©j player i must compute the mixed utility 
Ui{ai,q_i{t)), VcTj e Ai. Recalling the definition of mixed utility ([U), this is equivalent to 
computing an expected value over an {n — 1)-dimensional probability simplex. In general, the 
complexity of this computation grows exponentially in terms of the number of players. 

IV. Sampled EP 

In order to mitigate the problem of computational complexity in EP, llH proposed Sampled 
EP In Sampled EP, players use a Monte-Carlo approach to avoid direct evaluation of the mixed 
utility when choosing a next-stage action. 

In particular, for each a* e A^, let Ui{ai,t) denote an estimate that player i forms of the 
mixed utility Ui{ai,q-i{t)). Each round of play, for each player i, several “test actions” are 
drawn as random samples from opponents’ joint empirical distribution q-i{t). Eor each action 
Ui E Ai, player i computes the average utility the action at would generate given the randomly 
sampled “test actions.” Player i then chooses a next-stage action that is myopically optimal using 
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the estimated utility as a surrogate for the true mixed utility in Q. So long as the number of 
samples inereases suffieiently quiekly, it ean be shown that Sampled FP leads players to learn 
NE strategies. 

Formally, let kt denote the number of samples drawn in round t. The following assumption 
on kt is sufficient to ensure learning is achieved. 

A. 1. The number of samples drawn in round t satisfies kt = where 7 > 1/2 and C > 0. 

The Sampled FP algorithm is outlined below. 


A. Sampled FP Algorithm 


Initialize 

(i) Each player i chooses an arbitrary initial action ai(l) G A,. The empirical distribution is 
initialized as qfil) = afil), Vi. 

Iterate (t > 1) 

(ii) kt “test actions” are drawn as random samples from the joint empirical distribution q{t); let 
d^{t) denote the s-th random sample drawn in round t. Player i estimates the utility of each of 
her actions a* G A* bjO 


(^2 } f ') 



dF{t)). 


(3) 


(iii) Each player i chooses a next-stage action that is a best response given her estimate of the 
mixed utility: 

aft + 1) G arg max Ufat, t). (4) 


(iv) The empirical distribution for each player i is updated recursively to account for the action 
just taken: 

qft + 1) = q{t) + {aft + 1) - qft)) . 


B. Convergence Result 

The following result ( Theorem 5) shows that under AJU Sampled EP achieves NE 
learning in the same sense as classical EP for a special class of potential games known as 


'’since a-i{t) is a pure strategy, the evaluation of the utility is relatively simple. Note also that it is not necessary to draw a 
separate sequence of “test actions” for each player i, although, doing so does not affect the convergence result. 
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identical interests games IfTdll . in which all players use an identical utility function; i.e., Ui{y) = 
Uj{y), My e Y, MiJ e N. 

Theorem 1 (Theorem 5 in |l9l). Let T be a finite game in strategic form with identical payoffs. 
Then, any Sampled FPprocess with sample sizes satisfying Afllconverges in beliefs to equilibrium 
with probability 1. That is, d{q(t), NE) —)■ 0 almost surely as t ^ oo. 

V. Computationally Efficient Sampled FP 

While Sampled FP does obtain computational savings when compared to classical FP, it may 
be considered unsatisfactory in the sense that it requires players to draw a number of samples 
each round that grows without bound (see AjI]). In this section, we present an adaptation of 
Sampled FP in which only one sample need be drawn each round of the repeated play. 

A. Algorithm Setup 

In CESFP, players form an estimate of the mixed utility using a recursive stochastic-approximation- 
type rule. Similar to Sampled FP, let Ufaifi) be the estimate which player i maintains of the 
mixed utility Ufat, q-ft)) for each of her actions at G Aj. Fet {p{t)}t>i be a deterministic 
sequence of weights to be used in the stochastic-approximation-type procedure, and assume: 

A. 2. The sequence is such that 0 < pit) < 1, linat-i>oo = 0. 


Note that, by Femma [3l Aj2] implies that The Computationally Efficient 

Sampled FP algorithm is outlined below. 

B. Computationally Efficient Sampled FP Algorithm 
Initialize 

(i) For each i G iV, let aj(l) G A* be arbitrary. Initialize the empirical distribution as qfil) = 
ai(l), Mi, and initialize the utility estimate as U{ai,0) = 0, Va, G A,, Vi 

Iterate {t > 1) 

(ii) A single “test action” a*{t) is drawn as a (statistically independent) random sample from 


June 16, 2015 


DRAFT 


10 


the distribution q{t), and each player i updates the estimate Ui{ai,t) for each action Oj e Ai 
according to the recursion]^ 

Ui{ai, t) = {1- p{t))Ui{ai, f - 1) + p{t)Ui{ai, (5) 

(iii) Each player i chooses a next-stage action using the rule (cf. @): 

Qiit -f 1) G arg max Ui{ai, t). (6) 

aiSAi 

(iv) The empirical distribution for each player i is updated to reflect the action just taken: 

qi{t + 1) = qi{t) + + 1) - (7) 


C. Discussion 

The main difference between Sampled FP and CESFP is the manner in which players form 
estimates of the mixed utility sequence {U{ai, q-i{t))}t>i, G Ai. In Sampled FP, players’ 
estimates (see ([3])) “start afresh” each round of the repeated play—information gathered from 
sampling in the previous round is discarded, and players draw roughly new samples in order 
to form an estimate of the utility for the current round. 

This may be considered an inefficient use of information, since the mixed utility only changes 
slightly from one round to the next. In particular, note that the mixed utility Ui{ai, ■) is Fipschitz 
continuous with some Fipschitz constant K, and the increment of the empirical distribution ([7]) 
is bounded as ||g(t) — q{t — 1)|| < ^ for some constant M > 0. Thus, the increment in the 
mixed utility is bounded as 


\Ui{ai,q_i{t)) - Ui{ai,q_i{t))\ < KM/t. 


( 8 ) 


Intuitively speaking, this means that if one has an accurate estimate of the mixed utility 
Ui(ai, q-i[t — 1)) in round (f — 1), then it is wasteful to wholly discard this information when 
forming an estimate of Ui{ai, q-i{t)). The CESFP estimation rule leverages the diminishing 
increment property ([8]) in order to form an accurate estimate using only one sample per round. 


^Since is a pure strategy, the evaluation of the utility is relatively simple. Also note that it is not necessary for each 

player i to draw a separate “test action” + 1); although, if desired (for example, in a distributed setting 1231 ). doing so 
does not affect the convergence result. 
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Effectively, the Sampled FP estimation rule treats {Ui{ai,q_i{t))}t>i as if it were arbitrar¬ 
ily generated from one round to the next—drawing a completely new set of 7 > 1/2 
samples to estimate each g_j(f)). The CESFP estimation rule, on the other hand, treats 

{Ui{ai,q_i{t))}t>i as if it were quasi static, drawing one sample per round, and taking a type 
of average over timeO Because of this, despite drawing only one sample per round, the CESFP 
estimate of Ui{ai,q-i{t)) effectively utilizes information from t samples, while the Sampled FP 
estimate utili z es information from (only) C, 7 > 1/2 samples. In practice, the CESFP and 
Sampled FP estimation rules tend to reduce estimation error at comparable rates. See Section 
|Vl]for more details. 


D. Main Result 

The following theorem states that CESFP achieves learning in the same sense as classical FP 
(and Sampled FP). The result is stated for a slightly broader classes of games than previously 
discussed, including two-player zero-sum games ifT^ . and generic 2 x m games [|35]|. 

Theorem 2. Let T be a potential game, zero-sum game, or generic 2xm game. Let {a(f)}i>i be a 
Computationally Efficient Sampled FP process, and assume Aj^holds. Then limi_j.oo d{q{t ), NE) = 
0 (a.s.). 

Proof: We will prove the result by showing that there exists a sequence {et}t>i with 
lim^^oo Q = 0 such that Ufaft + 1), q-iit)) > max„,e^, Ui{ai, q-ft)) - e*. By [|20l. Corollary 
5, this is sufficient to guarantee d{q{f), NE) = 0. 

Since, by ®, aft + 1) G argmax^.g^. Ufai, t), it is sufficient to show that, for every z e iV 
and every G A,, 

\Ufai,t) - Ui{ai,q-i{t))\ ^ 0 as t ^ 00 a.s. (9) 


(Note that the individual action spaces Ai are finite.) We will show this by invoking the 
result of Femma [2l (see appendix). To that end fix z G iV and ai G At, and let X{t) : = 


^Additional insight may be gained by considering the dynamical systems approach to stochastic approximations (e.g. OH) 
which allows one to study the behavior of certain discrete-time processes by analyzing an associated differential equation. In such 
an analysis, an estimation rule such as nil is often considered as a two time-scale system 1261 . (Ml. with the ODE associated 
with the estimation rule operating at & faster rate than ODE associated with the mixed utility process. In the asymptotic analysis 
of such systems, the slower process may often be treated as effectively static compared to the faster process. We note, however, 
that the proofs of our results rely primarily on self contained martingale-type arguments, rather than invoking results from 
dynamical systems based treatment of stochastic approximation literature. 
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Ui{ai,a*_i{t)), t > 1, := Ui{ai,q_i{t)), t > 1, and fi{t) := Ui{ai,t), t > 0. For t > 0, let 

Tt := a({g_j(s)}*t\). Note that /i(t) is -measurable and that E{X{t)\Ft-i) = 

In order to invoke Lemma [2] it is suffieient to show that 

(^ -- 1 )) ^ 0 . ( 10 ) 

Let M := maxg^.^g" ,gA. \\q'-i - q'UW and note that by ©, \\q-i{t) - q-i{t - 1)|| < The 
utility funetion Ui is multilinear, and henee Lipsehitz eontinuous, so there exists a eonstant K 
such that 


|/i(t) - - 1)1 = \Ui{ai,q_i{t)) - Ui{ai,q_i{t - 1))| 

<K\\q_,{t)-q.,{t-l)\\ <KM/t. 

This, together with Aj2l implies that (fTOl) holds. 

Thus, X{t), yu(t), and Xt as defined above fit the template of Lemma [2l By Lemma [21 
\Ui{ai,t) — Ui{ai,q_i{t))\ = \jl{t) — ^{t) \ ^ 0 as t —)■ oo, verifying that (|9l) holds. ■ 

VI. Simulation Results 

In order to demonstrate the computational properties of CESFP and Sampled FP in large 
games, we simulated both algorithms in a simple traffic routing scenario. Let N = {1,.. .n} 
denote a finite set of drivers (or players). Drivers share a common starting point and a common 
destination and may travel on one of 50 parallel routes. Let the set of routes be denoted by R, 
and let the action space of player i be given by V* = R, Vi Let ar{y) denote the number of 
drivers on route r given the joint strategy y. Each route r E R has an associated cost function 
Cr ; N —)■ M signifying the delay experienced on route r given the number of drivers using the 
route. Eet the utility function of player i be given by Ui{y) := —Cy.{ay^{y)). We note that this 
game is an instance of a congestion game—a known subset of potential games. 

We simulated Sampled EP and CESEP in this routing scenario with 1000 drivers. In the 
simulation. Sampled EP used a sample rate of kt = samples per round and CESEP used 
the parameter p{t) = Vf. Eigure [T(aj] shows the wall clock running time through iteration 
t for each algorithm. 

Eigure |l(b)| shows a logarithmic plot of the expected total travel time if the current-iteration 
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(a) 


(b) 


Fig. 1. (a) Wall clock evaluation time for each algorithm; and (b) Expected total travel time for the mixed strategy q(t). 


empirical distribution q{t) were to be used as the joint mixed strategy. While a Nash equilibrium 
of the traffic routing game does not necessarily minimize total travel time, the trend shown in 
Figure [T(bj] is consistent with convergence of q{t) to NE, and suggests a comparable convergence 
rate (per repeated-play iteration) for both algorithms. 

VII. Conclusions 

The classical Fictitious Play (FP) algorithm can be prohibitively difficult to implement in games 
with many players. Sampled FP [|9l —a Monte-Carlo based variant of FP—has previously been 
proposed as a method of mitigating computational complexity in large-scale implementations of 
FP Though Sampled FP does achieve mitigations in complexity, it suffers from the drawback 
that the number of samples that must be gathered in each stage of the algorithm grows without 
bound. 

The paper proposed Computationally Efficient Sampled FP (CESFP)—a variant of Sampled 
FP that requires only one sample to be drawn per stage of the algorithm. CESFP is shown to 
achieve Nash equilibrium learning in the same sense as FP. A simulation example was used to 
demonstrate the computational properties of CESFP compared to Sampled FP. The simulation 
example used a game with fairly simple structural properties. An interesting future research 
direction may be to study the relative empirical performance of Sampled FP and CESFP in 
games with more complex structure (e.g. ifTOl ). 

Appendix 

Lemma 1. Let {a:t}t>o satisfy xt ^ x as t ^ oo. Let {pt\t>Q satisfy 0 < pt < 1 and J2t>oPt ~ Then, the 
sequence {yt}t>o given by yt = {1 — pt)yt-i + PtXt, t > 1, satisfies yt ^ x as t ^ oo. 
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Proof: The result follows from Toeplitz’s lemma ll^ . ■ 

Lemma 2. Let {pt}t>i satisfy 0 < pt < 1, X]t>i Pt = oo and X]t>i Pt < o®- be a filtration and let 

{Xt}t>i be a sequence of bounded random variables, adapted to the filtration, say, \Xt\ < B. Let pt = ]E(Xt | J-t-i) 
and assume that — pt-i) 0 almost surely. Then, the sequence of random variables {p,t}t>o given 

by Pt = (1 — pt)pt-i + PtXt, t>l, satisfies \pt — /it| —>■ 0 almost surely. 

Proo/-Subtracting//j from both sides of pt = {I-pt)pt-i+ptXt givs,?. Et = {I-pt)Et-i+pt (^Xt - pt + - i 

where Et := pt - pt and St := pt-i - Pt, (5o := 0. 

Introduce the J't-adapted sequences, for f > 1 


Et = {I-pt)Et-i+pt{Xt-Pt), Fi=Ei 
Gt = (1 — pt)Gt-i + Pt St, Gi = 0, 

and note that Et = T* + G*. We will now show that Et ^ 0 and Gt ^ 0 almost surely. 

By assumption, (^X _ q almost surely. Lemma [T] applied to {Gt}t>i gives Gt ^ 0 almost surely. 

On the other hand, 

E{F^ I Et-i) = (1 - pt)^Fl, + p?E {{Xt - pt)^ I Tt-i) 

<(l-pt)^Fl,+plAB^ 

= (1 + pI) El, - 2ptFl, + pI^B\ 

Since ^t>o Pt ^ from the Robbins-Monro Lemma llJTl we conclude that, almost surely, {F^}t>i converges 
and J2t>o Pt^t-i < OO- these two properties imply Ft ^ 0 almost surely. ■ 

Lemma 3. Let be such that 0 < pt < 1 and limt^co -X = Q_ Then X]t>i Pt ~ tx)- 

Proof: We claim there exists c, T > 0 such that pt > c\ for all < > T. If this were not so, then for every 
c > 0 there would hold pt < ct infinitely often, which would imply ^ > ■i infinitely often—contradicting the 
hypothesis that limt^oo = 0- 

Thus, there exists c, T > 0 such that pt >c\'it> T, and hence X]t>i Pt ^ J2t>T Pt ^ <^J2t>T 7 = tx)- * 
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